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PROCESSORS AND METHODS OF OPERATING PROCESSORS 

The present invention relates to processors and methods 
of operating processors. The invention has particular 
5 application in parallel pipelined processors, such as 

very long instruction word (VLIW) processors. 

High perf ormance processors use a technique known as 
pipelining to increase the rate at which instructions 

10 can be processed. Pipelining works by executing an 

instruction in several phases, with each phase being 
executed in a single pipeline stage. Instructions flow 
through successive pipeline stages, with all partially- 
completed instructions moving one stage forward on each 

15 processor clock cycle. Instructions complete execution 

when they reach the end of the pipeline. 

Processors attempt to keep pipelines full at all times, 
thus ensuring a high rate of instruction completion. 
20 However, it is possible that an instruction may. not be 

able to progress through one of the stages of a 
. pipeline in a single clock cycle for some reason, for 
example, because it needs to access slow memory or to 
compute a multi-cycle operation. Such an event is 

~2'5~~~ known as a stall.- Wnen stage i of a pipeline staris it 
prevents the instruction at stage i-1 from making 
forward progress, even if the instruction at stag : e i-1 
is not itself stalled. This in turn stalls stage i-2, 
and. so on up to stage 0 (the first stage) . When there 

30 is a stall at stage i, a signal flows to all stages 

from 0 to i-1 in the pipeline to cause them to stall 
before the next active edge of the pipeline clock. 



Some processor architectures provide two or more 
parallel pipelines for processing different * 
instructions (or different parts of an instruction) 
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simultaneously . In this case, the stall signal must be 
distributed to all pipelines fee ensure that 
instructions which are" issued in parallel also complete 
in parallel. However, the delay of propagating such a 
5 global stall signal may restrict the operating clock 

frequency of the processor. Furthermore, the distance 
such a signal would have to travel would grow as more 
pipelines were added. Hence a processor having more 
pipelines would need a lower clock frequency, thus 
10 defeating the high throughput objective of adding 

further pipelines. 

The present invention seeks to overcome the above 
disadvantages . 

15 

According to a first aspect of the present invention 
there is provided a processor comprising: 

a plurality of pipelines, each pipeline having a 
plurality of pipeline stages for executing an 

2 0 instruction on successive clock cycles; and 

stalling means for stalling the execution of 
instructions in all of the pipelines in response to a 
stall signal generated in any one of the pipelines ; 

wherein the stalling means is adapted to stall the . 

25 execution of instructions in different pipelines in 

different clock cycles in response to the stall signal. 

Stalling; the execution of instructions in different 
pipelines in different clock cycles may make additional 

3 0 time available for distributing signals, such as global 

stall signals. This may allow the processor to operate 
at a higher speed than would otherwise be the case. 



Preferably the stalling means is adapted to stall the 
execution of an instruction in a pipeline not 
generating the stall signal at least one clock cycle 



later than the execution of an instruction in a 
. pipeline generating a stall signal. This may allow at 
least one clock cycle for distributing the stall signal 
from the pipeline generating the stall signal to other 
5 pipelines. 

Preferably the stalling means is adapted to release the 
stall in the pipeline not generating the stall signal 
at least one clock cycle later than the stall in' the 
10 pipeline generating the stall signal. In this way, the 

instructions in the various pipelines may return to 
alignment after being out of step for one or more clock 
cycles. 

15 if two or more pipelines generate a stall signal. 

simultaneously, then it may not be necessary to stall 
each pipeline in response to the stall signal generated 
by the other pipeline. This is because each pipeline 
may have already implemented the appropriate stall. If 
2 0 each pipeline were also stalled in response to the 

stall signal generated by the other pipeline, more 
stalls- than necessary may be implemented. Thus the 
stalling means may be arranged such that, when a 
pipeline stage in a first pipeline receives a stall 
signal from a second pipeline, the execution or 
instructions in the pipeline stage is not stalled if 
the pipeline stage stalled in the previous cycle in 
response to a stall signal generated by the first 
pipeline. 



30 



35 



The stalling means may be arranged such that, when a 
pipeline generates a stall signal at a stage i, all 
stages up to and including stage i of that pipeline are 
stalled. In this way, earlier stages in a pipeline are 
stalled to prevent instructions in a pipeline from 
overwriting each other. Later stages need not be 
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stalled, so the instructions in later stages may 
continue to progress through the pipeline as normal, 
although, if required, some or all of the later stages 
could also be stalled. 

5 

The stalling means may be arranged such that, when a 
pipeline generates a stall signal at a stage i, all 
stages up to and including stage i of that pipeline are 
stalled on a given clock cycle, and all stages up to 

10 and including stage i+m of a pipeline not generating a 

stall signal are stalled m clock cycles later than said 
given clock cycle, where m is an integer greater than 
or equal to 1 . By providing that all stages up to i+m 
are stalled m clock cycles later in a pipeline not 

15 generating a stall signal, the instructions in that 

pipeline which correspond to the stalled instructions 
in the pipeline generating the stall signal are 
stalled. 



20 



The processor may comprise a plurality of pipeline 
clusters, each cluster comprising a plurality of 
pipelines. In this case, the stalling means may be 
arranged to stall execution of instructions in 
pipelines within a cluster in the same clock cycle. 



25 



30 



Preferably, in operation, instructions entering the 
respective pipelines in parallel (that is to say, iri a 
particular clock cycle) exit the pipelines in parallel. 



The present invention also has application in the 
distribution of signals other than stall signals, and 
thus, according to another aspect of the invention, 
there is provided a processor comprising a plurality of 
pipelines, each pipeline having a plurality of pipeline 
35 stages for executing an" instruction on successive clock 

cycles, the processor being adapted to allow two 
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instructions Which enter and exit respective pipelines 
in parallel to become out of step with each other for 
at least one clock cycle. 



5 According to another aspect of the invention there is 

provided a processor comprising a plurality of 
pipelines, each pipeline having a plurality of pipeline 
stages, at least some pipeline stages having a 
processing circuit arranged to perform the following 
10 functions: 

receiving a processor clock signal; 
processing instructions on successive cycles of 
. said processor clock signal; 

generating a stall signal when the processing 
15 circuit requires the' processing of an instruction to be 

stalled;, and 

stalling the processing of an instruction in 
response to ah externally-generated stall signal; 
the processor also including control logic 

2 0 arranged to cause the execution of instructions in 

different pipelines to stall in different cycles in 
response to the same stall signal. 

Analogous method aspects of the invention are also 
25 provided. Apparatus features of the invention may be 

applied to method aspects and vice versa. 

Preferred features of the present invention will now be 
described, purely by way of exampile, with reference to 

3 0 the accompanying drawings, in which :- 

Figure 1 is a block diagram of parts of a 
processor according to a first embodiment of the 
present invention; 
35 Figure 2 is a representation of a seven stage 

pipeline ; 
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Figure 3 shows parts of two stages of first and 
second pipelines in the Figure 1 processor; 

Figure 4 is a block diagram of parts of a 
processor according to a second embodiment of the 
5 present invention; 

Figure 5 shows parts of one stage of a pipeline in 
the Figure 4 processor; 

Figure 6 is a state transition diagram showing the 
operation of part of the pipeline stage of Figure 5; 
10 Figures 7 to 10 show one example of the operation 

of the second embodiment; and 

Figures 11 to 14 show another example of the 
operation of the second embodiment. 

15 Figure 1 is a block diagram of parts of a processor 

according to a first embodiment of the present 
invention. In this embodiment, the processor is a very 
long instruction word (VLIW) processor which is 
designed to execute long instructions which may divided 

20 into smaller instructions. 

Referring to Figure 1, processor 1. comprises 
instruction issuing unit 10, schedule storage unit. 12, 
first and second execution units 14, 16, and first and 
~25~ second register files 18, 20. The instruction issuing 
unit 10 has two issue slots IS1, IS2 connected 
respectively to the execution units 14, 16. The first 
execution unit 14 is connected to the first register 
file 18 and the second execution unit is connected to 

30 the second register file 20. The register files 18, 20 

are connected to each other via a bus 22. As an 
alternative to two register files connected via a bus, 
a single register file could be provided instead. Each 
of the execution units 18, 20 is also connected to an 

35 external memory 24 via bus 26. In this example the 

external memory 24 is random access memory (RAM) , 



although it may be any other type of memory. 



In operation, an instruction packet for execution is 
passed from the schedule storage unit 12 to the 
5 instruction issuing unit 10.. The instruction issuing 

unit 10 divides the instruction packet into its 
constituent instructions, and issues the two 
instructions to the execution units 14, 16 via the 
issue slots IS1 and IS2 respectively. The execution 
10 units 14, 16 then execute the various instructions 

simultaneously. In this way, different parts of a long 
instruction are processed in parallel. 

Each of the execution units 14, 16 uses a pipelining 
15 technique to maximise the rate at which it processes 

instructions. Pipelining works by implementing each of 
a plurality of phases of instruction execution as a 
single pipeline stage. Instructions flow through 
successive pipeline stages, in a. production- line 
20 fashion, with all partially-completed instructions 

moving one stage forward oh each processor clock cycle. 
Instructions complete execution when they reach the end 
of the pipeline. 

25 Figure 2 is a representation of a seven, stage pipeline. 

In this representation, the content of each stage is 
the sequence number of the instruction which has 
reached that stage of the pipeline. Instructions in 
the pipeline flow from left to right, from stages 0 to 

30 7 . 

It is desirable to keep the pipeline full at all times, 
thus ensuring a high rate of instruction completion. 
However, it is possible that an instruction may not be 
35 able to progress through one- of the stages in a single 

clock cycle for some reason, for example because it 
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needs to access slow memory or to compute a multi- cycle 
operation. Such an event is known as a stall. When 
stage i stalls' it prevents the instruction at stage i-1 
from making forward progress, even if the instruction 
5 at stage i-1 is not itself stalled. This in turn 

stalls stage i-2 and so on up to stage 0 (the first 
stage) . If stages 0 to i are stalled at time T, then 
at time T+l stage i + 1 will have no instruction to 
process. If the stall persists for another cycle, then 
10 at time T+2 stages i+1 and i+2 and will have no 

instructions to process. These empty pipeline stages 
are known as bubbles. In Figure 2, a two-cycle bubble 
is shown in stages 3 and 4 . 



15 VLIW processors are designed such that instructions 

which are issued to different pipelines in parallel 
also complete their execution in parallel. If this 
rule were relaxed it would prove very difficult to stop 
a running process cleanly and to restart it at some 

20 later time. Thus, if there is a stall in one pipeline, 

each other pipeline must also stall to ensure that the 
various instructions exit the pipelines in parallel. 
For example, if there is a stall in the pipeline in 
execution unit 14 in Figure 1, then the pipeline in 

2b execution unit 16 must also stalls 

When there is a stall at stage i of a pipeline, a stall 
signal is generated by that stage. This. stall signal 
is distributed to all stages from 0 to i-1 in the 
3 0 pipeline to cause them to stall before the next active 

edge of the pipeline clock. 

One possible scheme for stalling the other pipeline in 
the Figure 1 processor would be to take the logical OR 
35 of the stall signal from. each stage of both pipelines 

and to distribute the result to both pipelines. 
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However, this would require the transmission of a 
global stall signal to all pipeline stages before the 
next active edge of the pipeline clock. For high speed 
processors the delay in propagating such a global stall 
5 signal may restrict the operating clock frequency of 

the processor. Furthermore, if more pipelines were 
added in order to increase the processing rate, the 
physical distance the global stall signal would have to 
travel would increase, thereby restricting the 
10 operating clock frequency even further. 

According to the first embodiment, instructions passing* 
through the respective pipelines may get one stage out 
of step with each other, thus providing a full clock 
15 period for a stall signal to pass from one pipeline to 

the other. 

The operation of the first embodiment will now be 
described with reference to Figure 3. Figure 3 shows 

2 0 two stages, stage i and stage i+1, of a first pipeline 

(pipeline 1) and the corresponding stages of a second 
pipeline (pipeline 2) . Stage i of pipeline 1 comprises 
instruction register 40., processing circuit 42, OR gate 
44, registers 45, 46, AND gate 47 and OR gate 48/. stage 

25 i+1 of pipeline i comprises instruction register 50, 

processing circuit 52, OR gate 54,. registers 55, 56, 
AND gate 57 and OR gate 58 /.stage i of pipeline 2. 
comprises instruction register 60, processing circuit 
62, OR gate 64, registers 65, 66, AND gate 67 and OR 

30 gate 68/ stage i+1 of pipeline 2 comprises instruction 

register 70, processing circuit 72, OR gate 74, 
registers 75, 76, AND gate 77 and OR gate 78. In this 
example, registers 45, 55, 65,' 75 and 46, 56, 66, 76 
are implemented as D-type flip-flops. All clock inputs 

35 are fed by a common clock signal. Stages i and i+1 of 

pipeline .1 are part of the execution unit 14 in Figure 
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1 and stages i and i+1 of pipeline 2 are part of the 
execution unit 16 in Figure 1. 

In normal operation, each processing circuit 42, 52, 
5 62, 72 executes one phase of an instruction held in the 

corresponding register 40, 50, 60, 70. The stage i 
processing circuits 42 and 62 execute in parallel phase 
i of two instructions belonging to one VLIW packet and 
the stage i+1 processing circuits 52 and 72 execute 

10 simultaneously phase i+1 of two instructions belonging 

to another VLIW packet. On each clock cycle, the 
instructions held in the registers are passed to the 
next registers in the pipelines for further processing. 
In this way instructions flow through the pipelines 

15 with all partially completed instructions moving 

forward one stage on each clock cycle. 

Each of the processing circuits 42, 52, 62, 72 is able 
to assert a stall signal if it requires the progress of 

20 the instruction it is executing to be stalled on the 

next clock cycle. The stall signals from processing 
circuits 42, 52, 62, 72 are fed to OR gates 44, 54, 64, 
74 respectively. The outputs of the OR gates 44, 54, 
64, 74 are fed to OR gates 48, 58, 68, 78 respectively. 

25 The OR gates 48, 58, 68, 78 output hold signals to 

processing circuits 42, 52, 62, 72 respectively. Thus, 
if one of the processing circuits asserts a stall 
signal, the hold signal of that processing circuit is 
asserted via the corresponding OR gates. If the hold 

30 signal input to a processing circuit is. set, then that 

processing circuit will stall on- the next clock cycle. 

The output of each of the OR gates 44, 54, 64, 74 is 
also fed as a ripple signal to the corresponding OR 
35 gate in the previous stage in the same pipeline. The 

ripple signals thus ripple down the pipelines, so that 



if a processing circuit asserts a stall signal, the 
hold signals input to all previous processing circuits 
in the same pipeline are also asserted. 



5 The output of each of the OR. gates 44, 54, 64, 74 is 

also fed to the next stage of the other pipeline as a 
global signal. Each of the registers 46, 56, 66, 76 
receives such a global signal from the previous stage 
of the other pipeline. For example, the output of OR 
10 gate 44 is fed to register 76 and the output of OR gate 

64 is fed to register 56. 

Each pf the registers 46, 56, 66, 76 delays the signal 
at its input until the next clock cycle.. Thus, the 

15 output of each of the registers 46, 56, 66, 76 is the 

global signal from the previous stage' of the other 
pipeline, delayed until the next clock cycle. The 
outputs of the registers 46, 56, 66, 76 are fed via 
respective AND gates 47, 57, 67, 77 to respective OR 

20 gates 48., 58, 68, 78, which output hold signals to 

processing circuits 42, 52, 62, 72 respectively. . Thus, 
assuming the other inputs to the AND gates 47, 57, 67, 
77 are set, if a processing circuit asserts a stall 
signal, the hold signal input to the next, processing , 

25 . circuit in the other pipeline will be asserted ih the 

next clock cycle. 

The stall signals of processing circuits 42, 52, 62, 72 
are also fed to registers 45, 55, 65, 75 respectively. 

30 The inverting outputs of registers 45, 55, 65, 75 are 

, fed to AND gates 47, 57, 67, 77 respectively. -Thus, 
the output of each of the AND gates 47; 57, 67, 77 is 
reset if the corresponding processing circuit assorted 
a stall signal in the previous clock cycle, regardless 

35 of the state. of registers 46, 56, 66, 76. This 

prevents .the registers 46, 56, 66, 76 from generating a 
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stall signal if the corresponding pipeline stage 
• stalled in the previous clock cycle due to a locally 
generated stall signal. In this way, if the two 
pipelines generate a stall signal in the same clock 
5 cycle, only one stall is implemented. 

As an example, if processing circuit 42 sets the stall 
signal in a given clock cycle, then OR gates 44 and 4 8 
ensure that the hold signal of that stage is set. A 

10 ripple signal flows to all stages i-1 to 0 via the 

corresponding OR gates of those stages, causing each of 
those stages to set the hold signal. Thus, a hold 
signal is set at all stages 0 to i of the first 
pipeline before the next active edge of the pipeline 

15 clock, causing those stages to stall on the next clock 

cycle. However, because the output of OR gate 44 is 
delayed by register 76, the hold signal input to 
processing circuit 72 is not set until the' next but one 
active edge of the pipeline clock. _ Similarly, the hold 

2 0 signal input to processing circuit 62 is delayed by one 
clock cycle due to register 66, and so on until stage 1 
of the second pipeline. Thus, stages 0 to i+1 of the 
second pipeline stall one clock cycle later than stages 
0 to i of the first pipeline. It, is necessary to stall 

25 stages 0 to i+1 of the second pipeline because the 

instructions in that pipeline will have advanced one 
stag;ei while the instructions in the first pipeline are 
stalled. Stage 0 of a pipeline is arranged to stall 
when stage 1 of that pipeline stalls . 

3.0 

While the stall signal from processing circuit 42 
remains set, the hold signal to stages 0 to i of the 
first pipeline and stages 0 to i+1 of the second 
pipeline remain set, causing those stages to remain 
35 , stalled. When the stall signal is released, the hold 

signals to stages 0 to i of the first pipeline are also 
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released before the next active edge of the clock, so 
that- normal- -operation resumes in- t-ha-t pi-pel-i-ne on the - 
next clock cycle. However, due to the operation of the 
registers 66, 76 (and the corresponding registers in 
5 previous pipeline stages) , the hold signals to stages 0 

to i+1 of the second pipeline are only released on the 
next but one clock cycle. Thus in the second pipeline 
normal operation resumes on the next but one clock 
cycle. This one cycle delay in releasing the stall 
10 allows the two pipelines to get back into step with 

each other. 

Thus it will be seen that individual instructions 
within the two pipelines may become out of step by one 
15 cycle, thereby providing a full clock period for stall 

signals to propagate between the pipelines. 

The first embodiment has the following main features: 

1. Either of the pipelines can assert a stall signal 
20 in any cycle that is not itself the subject of a 

stall. 

2. Stages prior to a stalling stage that are in the 
same pipeline will stall in the same cycle as the 
stalling stage,, but stages .in the other pipeline 

25 will be stalled one cycle later. 

3. When a stall is released, all stalled stages in 
that pipeline make forward progress immediately, 
but stalled stages in the other pipeline are 
released one cycle later. The net effect of this 

30 is that the stalled operations return to alignment 

after being out -of -step by one stage during the 
stall. 

4. The logic to control the stalling of each stage is 
local to that stage. A full clock cycle is 

35 available to distribute the global stall signal to 

the. other pipeline. 



10 
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It will be noted that the last stage of a pipeline is 
not able to assert a stall signal, because it wou-ld not 
then be able to stall the other pipeline in time to 
prevent the instruction in the last stage from exiting 
the pipeline. This may be dealt with, either by 
arranging the processor and the instruction set such 
that the last stage of a pipeline never needs to assert 
a stall signal, or by adding a final dummy stage to 
each pipeline . 



If desired, a delay of two or more clock cycles may be 
introduced for the propagation of stall signals between 
the pipelines, by arranging the registers 45, 55, 65, 
75 and 46, 56, 66, 76 to delay the signals at their 

15 inputs by two or more clock cycles. For example, two 

or more D-type flip-flops connected in series may be 
used in place of each flip-flop 45, 46, 55, 56, 65, 66, 
75, 76 in Figure, 3. In this case, the processor may 
ensure that the last m stages of the pipeline do not 

20 assert a stall signal, where m is the number of clock 

cycles delay, or a number m of dummy stages may be 
added, or a combination of the two approaches may be 
used. 

25 Figure 4 shows parts of a processor according to a 

second embodiment. Referring to Figure 4, processor 
100 comprises an instruction issuing unit 102, a 
schedule storage unit 104, first to eighth execution 
units (E.U.) 106, 108, 110, 112, 114, 116, 118/ 120 and 

30 first to fourth register files 122, 124, 126, 128. The 

instruction issuing unit 102 has eight issue slots IS1 
to IS8 connected respectively to the execution units 
106 to 120. The first and second execution units 106, 
108 are connected to the first register file 122, the 
35 third and fourth execution units 110, 112 are connected 

to the second register file 124, the fifth and sixth 
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execution units 114, 116 are connected to third 
register file 126, and the seventh and eighth execution 
units 118, 120' are connected to the fourth register 
file 128. Each of the execution units is also 
connected to an external memory 132, such as a RAM 
device, via bus 134. 

In operation, an instruction packet (VLIW packet) for 
execution is passed from the schedule storage unit 104 
to the instruction issuing unit 102. The instruction 
issuing unit 102 divides the instruction packet into 
its constituent instructions, and issues the 
instructions to the execution units 106 to 120 via 
issue slots ISi to IS8. The execution units 106 to 
120 then execute the 1 various instructions belonging to 
the packet simultaneously. 

The execution units 106 to 120 are divided into four 
groups, with each group having its own register file. 
This is done in order to reduce the number of access 
slots to any one register file. If a single register 
file were provided, the register file may. have too many 
access slots, which would increase access time to the 
rpgisce r .file.. Each group of execution units with, a 



common register file may be referred to as a cluster. 
In Figure 4, a first cluster is formed by execution 
units 106, 108 and register file 122, a second cluster 
is formed by execution units 110, 112 and register file 
124, a third cluster is formed by execution units 114, 
116 and register file 12 6, and a fourth cluster is 
formed by execution units 118, 120. and register file 
128. If a value held in one cluster is required in 
another cluster, the value is transferred between the 
clusters via a bus 130. While four clusters are shown 
in Figure 4, more or fewer clusters may be provided as 
required. Each cluster may have one, two or more 
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execution units. 

As in the first embodiment, each of the execution units 
10 6 to 12 0 uses pipelining to increase the rate at 
5 which it processes instructions. Pipelines within a 

cluster stay in step with each other, whereas pipelines 
in different clusters are allowed to get one or more 
cycles out of step with each other. 

10 Figure 5 shows -the block structure of the ith stage ih 

a two pipeline cluster. Stage i of the cluster 
comprises stage i of the first pipeline in the cluster 
(pipeline 0) , stage i of the second pipeline in the 
cluster (pipeline 1) , and common control circuitry for 

15 controlling the two pipeline stages. Stage i of 

pipeline 0 comprises register 140 and processing 
circuit 142, and stage i of pipeline 1 comprises 
register 144 and processing circuit 146. The control 
circuitry is formed by stall control logic 148, 

2 0 registers 15 0, 152, 154 and OR gate 15 6. 

In operation, stage i of the cluster can be either 
active, or stalled by a stage j in the same cluster, 
where j^i, or stalled by a stage 1 in another cluster 
~25 where 1 > i - 1 . Hence,, stage i of the cluster has a state 

variable CurrentState i, which indicates whether the 
stage is active (A) , locally stalled (L) or globally 
stalled (G) . 

3 0 The value of the variable CurrentState i is held in 

register 150. The input to' register 150. is a signal 
NextState i which determines which state the stage will 
be in in the next clock cycle. . 



The behaviour of the stall control logic 148 is 
governed , by a set of state transitions, as shown in 
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15 



Table 1. In Table 1, each row represents a possible 
transition that can be made, and explicitly states the 
conditions under which it can take place. Each 
transition is from the current state to the next state, 
and will take place if the stage in question is in the 
current state for a given transition and the values of 
the boolean variables local and global_in are as 
indicated in the entry for that transition. The local 
variable is set when there is a local stall, i.e. one 
that originates- in the local cluster. This is true 
whenever the current stage has a pending stall or if 
any stage after the current stage will be in the 
locally stalled state in the next cycle. The global_in 
variable is set if any other cluster asserted its local 
signal in the previous cycle. 



Table 1 



20 



25 



30 



Transition 


Current 
State 


global_in 


local 


Next 
State 


i 


' A 


.0 


0 


A 


4 


A 


0 


1 


L 


6 


A 


1 


0 


G 




■ A 


1 


1 


G 


5 


L 


0 


0 


A 


3 


L 


0 


1 


L 


5 


L 


1 


0 


A 


3 


L 


.1 


1 


L 


7 


G 


0 


0 


A 


8 


G 


0 


1 


L 


2 


G 


1 


0 


G 


2 


G 


1 


1 


G 



A state transition diagram for the behaviour of each 
pipeline stage is shown in Figure 6. in Figure 6, 
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25 



30 



35 



transition numbers correspond to the numbers in the 
first column -of Table 1. As will be apparent to the 
skilled person, the appropriate logic circuits may be 
derived routinely from the state transition diagram 
and/or table, and are accordingly not described 
specifically herein. 

in operation, register 140 holds instructions for 
execution by processing circuit 142 and register 144 
holds instructions for execution by processing circuit 
146. If the phase of. an instruction being executed in 
clock cycle T by one of the processing circuits 142, 
146 in Figure 5 requires more than one cycle for 
execution, then that processing circuit asserts a stall 
signal. The stall signals from stage i of both 
pipelines are fed to OR gate 156. The output of OR 
gate 156 is a signal stall i, which indicates whether 
stage i of one or more of the pipelines in that cluster 
has asserted a stall signal. The signal stall i is fed 
to stall control logic 148. 

Based on the stall i signal, the stall control logic 
148 generates a signal local i, as follows: 

local i = (stall i OR ripple i+D 

The signal ripple i is set whenever the next state of 
stage i. of the cluster will be L (i.e. locally 
stalled) . Hence: 

ripple i = (NextState i = L) 

The signal ripple i+l is the corresponding signal 
generated by stage i+l of the cluster. 

If ripple i is asserted during cycle T, then ripple i- 
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1, i-2, . .., i-k will also be asserted in the same 
cycle as a consequence of the local stall in stage i. 
The ripple-dowh of the localised stall signal will 
terminate at the left-most end of the pipeline, or 
5 earlier if some stage will not be in the L stage in 

cycle T+l (i.e. NextState i-k * L) . This could be 
caused by a localised stall in another cluster at stage 
, i-k-1 which was present in cycle T-l and caused the 
global_in signal at stage i-k to be asserted in cycle 
10 T. 

The NextState i signal is generated based on the local 
i signal, the CurrentState i signal and the global_in i 
signal, as shown in the final column of Table 1. 
15 On the next active edge of the clock, the signal 

NextState i.is registered in register 150, so that the 
signal CurrentState i in cycle T+l is equal to the 
signal NextState i in cycle T. 

20 . If the value of NextState i : in any clock cycle is 

either L or G (corresponding to locally or globally 
stalled) then the hold i signal is asserted. In 
response, the instructions in registers 142 and.i44 
remain the same in the next clock cycle. 

The various clusters communicate their stall statuses 
to each via the ripple and global_in signals. The 
next_global_in i. signal for cluster C is the logical OR 
of the ripple i-l signals from all clusters except C. 

3 0 The next_global_in i signal is registered at the end of 

every clock period in register 154 to give the signal 
global_in i. The global__in signals for stage 0 of all 
clusters are always false. In each pipeline, stage 0 
is arranged to stall when stage 1 of that pipeline 

35 stalls. 
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For example, if the phase of the instruction being 
executed in cycle T By a processing circuit- in stage h 
(h £ i-1) of another cluster requires more than one 
clock cycle, then its local h signal is asserted in 
cycle T. This causes the next__global_in signals of all 
stages 1 to h+l in all other clusters to be asserted 
before the end of cycle T. This in turn causes the 
global_in i signal of stall control logic 14 8 to be 
asserted in cycle T+l. 



The valid i signal is, used to indicate that a bubble is 
present in stage i of the cluster. For example, if 
stage, i-1 is stalled, then the signal valid__out i-1 is 
set false, indicating that there is a bubble in stage 
15 i-1. On the next clock cycle, the. signal valid_out i-1 

is registered by register 152 to give the signal valid 
i. If valid i is set false, processing circuits 142, 
146 ignore the instructions in registers 140, 144. 

2 0 Thus it can be seen that the second embodiment provides 
a- distributed stalling scheme, in which each stage 
determines locally whether it needs to stall the 
processor at each cycle of program execution and 
: ap pprt-R «t--*lv.-i if a stall at stage i is required . , If 

25 the stall logic for stage i determines that any one of 

the three causes of a stall requires that stage i be 
stalled in the current cycle theii it asserts the hold i 
signal. In response, stage i retains the instruction 
in its input register. 

30 

The second embodiment has the following main features: 
1. Any pipeline stage in any cluster can assert a 

stall signal in any cycle that is not itself the 

subject of a stall. 
35 2 . Stages prior to a stalling stage that are in the 

same cluster will stall in the same cycle as the 
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stalling stage, but stages in other pipelines will 
be stalled one cycle later. 

3. When a stall is released, all stalled stages in 
that cluster make forward progress immediately, 

5 but stalled stages in other clusters are released 

one cycle later. 

4. The logic to control the stalling of each stage is 
local to that stage. Where stall signals are 
communicated globally (from one. cluster to 

10 another) a full clock cycle is available to 

compute the global stall condition and to 
distribute it to all clusters. 

An example of the operation of the second embodiment 
15 wiil now be described with reference to Figures 7 to 

10. Figure 7 shows the situation where a stall occurs 
in stage 3 on instruction packet 103 in cluster 1 at 
time T=l. At this time cluster 1 stalls immediately 
from stage 3 to stage. 0 whereas other clusters are not 
20 stalled. Figure 8 shows the situation one cycle later 

at T=2 . Because of the one-cycle delay to inform other 
clusters that they should sitall at stage 3, it is not 
until this time that, the other clusters are able to be 
stalled.. By that time instruction. packet 103 has moved. 
25 on to stage 4, so the stall must take effect from stage 

4. 



Figure 9 illustrates the situation one cycle later at 
time T=3 when the stall is released and cluster 1 is 

3d free to make forward progress. In cluster 1, 

instruction packets 103 to 106 move forward and packet 
107 is inserted. Now the instruction packets in all 
clusters are realigned due to the delay in releasing 
clusters 0, 2 and 3 from their stalled state. As shown 

35 in Figure 10, one cycle later at time T=4, all clusters 

are released from the stall and progress continues as 
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normal . 

Another example of the operation of the second 
embodiment, in which a stall signal is generated by two 
different clusters, will now. be described with 
reference to Figures 11 to 14. In this example, a 
local stall signal is generated by both stage 3 in 
cluster 1 and stage 1 in cluster 2 at time T=l. 

Referring to Figure 11, at time T=l, stages 3 to 0 of 
cluster 1 and stages 1 and 0 of cluster 2 have the 
local signal set and the global_in signal reset, so 
that, referring to Table 1 (transition 4), the next 
state of these stages will be locally stalled (L) . 



One cycle later at time T=2 (Figure 12) , stages 3 to 0 
of cluster 1 and stages 1 and 0 of cluster 2 are in the 
locally stalled state, while the other stages are in 
the active state (A). Bubbles thus form in stage 4 of 

20 cluster 1 and stage 2 of cluster 2. Also at time T=2, 

stages 4 to 0 of clusters 0, 2 and 3, and stages 2 to 0 
of cluster 1, have their global_in signals set, due to 
the stalls that occurred in clusters 1 and 2 in the 
previous cycle. In this example the stalls only last 

25 for one cycle. Thus, referring to Table 1, the next 

state of stages 4 to 0 of clusters 0 and 3 and stages 4 
and 3 of cluster 2 will be globally stalled . (G) while 
the next state of stages 3 to 0 in cluster 1 and 1 to 0 
in cluster 2 will be active (A) . 

30 

At time T=3 (Figure 13) stages 4 to 0 of clusters 0 and 
3 and stages 4 and 3 of cluster 2 are globally stalled. 
Stages 1 to 0 of cluster 2 are active, and hence the 
instructions that were in these stages move forward one 
35 stage to fill in the bubble that occurred in stage 2 of 

cluster 2 at time T=2 . Since stages 3 and 4 of cluster 



2 are stalled, a new bubble forms at stage 5 of cluster 
2. Cluster 1 is not stalled. 

At time T=4 (Figure 14) all clusters are released from 
the stall and progress continues as normal. 

It will be noted from the above that, if a stall signal 
is generated in two different clusters in the same 
clock cycle, this only results in one bubble occurring 
in the pipelines. This is because, if a stage has its 
current state as locally stalled (L) , and the local 
signal is not set, the next state of that stage is 
active (A) , regardless of whether or not the global^in 
signal is set. This is indicated by transition 5 in 
Table 1 and Figure 6. 

In the above* examples, it is assumed that a stall is 
released one clock cycle after it is initiated, so that 
bubbles of one stage form. However, a stall may last 
for ail indefinite period of time, and thus bubbles of 
two or more stages may form, depending on the number of 
clock cycles for which a stall lasts. 

As in 'firs t-, embodim ent, the delayed stalling scheme 
of the second embodiment will not permit stalls to 
originate in the final stage. This may be dealt with 
either by arranging the processor and the instruction 
set such that the last stage of a pipeline never heeds 
to assert a stall signal, or by adding a final dummy 
stage to each pipeline, or both. 

As in the first embodiment, the delay in distributing a 
global stall signal between clusters may be one or more 
clock cycles. 



Although- the above description relates, by way of 
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example, to a VLIW processor it will be appreciated 
that the present invention is applicable to processors 
other than VLIW processors. A processor embodying the 
present invention may. be included as a processor "core" 
5 in a highly- integrated " system-on-a-chip" (SOC) for use 

in multimedia applications, network routers , video 
mobile phones, intelligent automobiles, digital 
television, voice recognition, 3D games, etc. 

10 It will be understood that the present invention has 

been described above purely by way of example, and 
modifications of detail can be made within the scope of 
the invention. 
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Each feature disclosed in the description, and (where 
appropriate) the claims and drawings may be provided 
independently or in any appropriate combination. 



CLAIMS 



1. A processor comprising: 

a plurality of pipelines, each pipeline having a 
5 plurality of pipeline stages for executing an 

instruction on successive clock cycles; and 

stalling means for stalling the execution of 
instructions in all of the pipelines in response to a 
stall signal generated in any one of the pipelines; 
10 wherein the stalling means is adapted to stall the 

execution of instructions in different pipelines in 
different clock cycles in response to the stall signal! 

2 . A processor according to claim 1 wherein the 
15 stalling means is adapted to stall the execution of an 

instruction in a pipeline not generating the stall 
signal at least one clock cycle later than the 
execution of an instruction in a pipeline generating 
the stall signal. 

3 . A processor according to claim 2 wherein the 
stalling means is adapted to release the stall in the 
pipeline not generating the stall signal at least one 

clock . cycle later than the- stall in the pipeline 

generating the stall signal. 

4. A processor according to claim -2 or 3 wherein 
the stalling means is arranged such that, when a 
pipeline stage in a first pipeline receives a stall 
signal from a second pipeline, the execution of 
instructions in the- pipeline stage is not stalled if 
the pipeline stage stalled in the previous cycle in 
response to a stall signal generated by the first 
pipeline . 

.5. A processor according to any of the preceding 



30 



claims .wherein the stalling means is arranged such 
that, when a pipeline generates a stall signal at a 
stage i, all stages up to and including stage i of that 
pipeline are stalled. 

5 

6. A processor according to any of the preceding 
claims wherein, when a pipeline generates a stall 
signal at a stage i, all stages up to and including 
stage i of that pipeline are stalled on a given clock 
10 cycle, and all stages up to and including stage i+m of 

a pipeline not generating a stall signal are stalled m 
clock cycles later than said given clock cycle, where m 
is an integer greater than or equal to 1. 

15 7. A processor according to any of the preceding 

claims wherein the processor comprises a plurality of 
pipeline clusters,, each cluster comprising a plurality 
of pipelines. 

20 .8. A processor according to claim 7 wherein the 

stalling means is arranged to stall execution of 
instructions in pipelines within a cluster in the same 
clock cycle. 



9. A processor according to claim 7 or 8 wherein 
the stalling means is arranged to stall the execution 
of instructions in pipelines in a cluster not 
generating the stall signal at least one clock cycle 
later than the execution of instructions in pipelines 
in a cluster generating the stall signal. 

10. A processor according to any of the preceding 
claims wherein, in operation, instructions entering the 
respective pipelines in parallel exit the pipelines in 

35 -parallel. 
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11. A processor comprising a plurality of 
pipelines, each pipeline having a plurality of pipeline 
stages for executing an instruction on successive clock 
cycles, the processor being adapted to allow two 

5 instructions which enter and exit respective pipelines 

in parallel to become out of step with each other for 
at least one clock cycle. 

12 . A method of operating a processor, the 

10 processor comprising a plurality of pipelines, each 

pipeline having a plurality of pipeline stages for 
executing instructions on successive clock cycles, the 
method comprising generating a stall signal in one of 
the pipelines, and stalling the execution of 

15 instructions in different pipelines in different clock 

cycles ih response" to the stall signal. 

13 . A method according to claim 12 wherein the 
execution of an instruction in a pipeline not 

2 0 generating. the stall signal is stalled at least one 

clock cycle later than the execution of an instruction 
in a pipeline generating the stall signal. . 

14 . A method according to claim 13 wherein the 
25 stall in the pipeline not generating the stall signal 

is released at least one clock cycle later than the 
stall in the pipeline generating the stall signal. 

15. A method according to claim 13 or 14 wherein, 

3 0 when a pipeline stage in a first pipeline receives a 

stall signal from a second pipeline, the execution of 
instructions ih the pipeline stage is hot stalled if 
the pipeline stage stalled in the previous cycle in 
response to a stall signal generated by the first 
35 pipeline. 
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16. A method according to any of claims 12 to 15 
wherein, when a pipeline generates a stall signal at 
stage i, all stages up to and including stage i of that 
pipeline are stalled. 

5 

17. A method according to any of claims 12 to 16 
wherein, when a pipeline generates a stall signal at 
stage i, all stages up to stage i of that pipeline are 
stalled on a given clock cycle, and all stages up to 

10 stage i+m of a pipeline not generating a stall signal 

are stalled m clock cycles later than said given clock 
cycle, where m is an integer greater than or equal to 

1. 

15 . i8. A method according to any of claims 12 to 17 

wherein the processor comprises a plurality of pipeline 
clusters, each cluster comprising a plurality of 
pipelines, and instructions in pipelines within a 
cluster are stalled in the same clock cycle. 

2 0 ' ■ . . 

19. A method according to claim 18 wherein the 
execution of instructions in pipelines in a cluster not 
generating the stall signal is stalled at least one 

clock cycle later than the execution of instructions in 

25 pipelines in a cluster generating the stall signal. 

20. A method according to any of claims 12 to 19 
wherein instructions, entering the respective pipelines 
in parallel exit the pipelines in parallel. 
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21. A method substantially as described with 
reference to Figures 1 to 14 of the accompanying 
drawings . 
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22. Apparatus substantially as described with 
reference to and as illustrated in the accompanying 
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drawings . 
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ABSTRACT 

PROCESSORS AND METHODS OF OPERATING PROCESSORS 

Processors comprising a plurality of pipelines' are 
disclosed, each pipeline having a plurality of pipeline 
stages (140) for executing an instruction on successive 
clock cycles. The processors allow an instruction in 
one pipeline to become temporarily out of step with an 
instruction in . another pipeline. This may allow time 
for a global signal, such as a global stall signal, to 
be distributed. 



(Figure 5) 
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